{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# **Random Forest**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# What is Random forest?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Random Forest is a popular machine learning algorithm that belongs to the ensemble learning method. It involves constructing multiple decision trees during training and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.\n",
    "\n",
    "Here are some key points about Random Forest:\n",
    "\n",
    "1. **Ensemble Method**: Random Forest is an ensemble of Decision Trees, usually trained with the \"bagging\" method. The general idea of the bagging method is that a combination of learning models increases the overall result.\n",
    "\n",
    "2. **Randomness**: To ensure that the model does not overfit the data, randomness is introduced into the model learning process, which creates variation between the trees. This is done in two ways:\n",
    "   - Each tree in the ensemble is built from a sample drawn with replacement (i.e., a bootstrap sample) from the training set.\n",
    "   - When splitting a node during the construction of the tree, the split that is chosen is the best split among a random subset of the features.\n",
    "\n",
    "3. **Prediction**: For a classification problem, the output of the Random Forest model is the class selected by most trees (majority vote). For a regression problem, it could be the average of the output of each tree.\n",
    "\n",
    "Random Forests are a powerful and widely used machine learning algorithm that provide robustness and accuracy in many scenarios. They also handle overfitting well and can work with large datasets and high dimensional spaces."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Real-Life Analogy of Random Forest..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Imagine you're trying to decide what movie to watch tonight. You have several ways to make this decision:\n",
    "\n",
    "1. **Ask a friend**: You could ask a friend who knows your movie preferences well. This is like using a single decision tree. Your friend knows you well (the tree is well-fitted to the training data), but their recommendation might be overly influenced by the movies you've both watched recently (the tree is overfitting).\n",
    "\n",
    "2. **Ask a group of friends independently**: You could ask a group of friends independently, and watch the movie that the majority of them recommend. Each friend will make their recommendation based on their understanding of your movie preferences. Some friends may give more weight to your preference for action movies, while others may focus more on the director of the movie or the actors. This is like a Random Forest. Each friend forms a \"tree\" in the \"forest\", and the final decision is made based on the majority vote.\n",
    "\n",
    "In this analogy, each friend in the group is a decision tree, and the group of friends is the random forest. Each friend makes a decision based on a subset of your preferences (a subset of the total \"features\" available), and the final decision is a democratic one, based on the majority vote. This process helps to avoid the risk of overfitting (relying too much on one friend's opinion) and underfitting (not considering enough preferences)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Working of Random Forest Algorithm?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The Random Forest algorithm works in the following steps:\n",
    "\n",
    "1. **Bootstrap Dataset**: Random Forest starts by picking N random records from the dataset. This sampling is done with replacement, meaning the same row can be chosen multiple times. This sample will be used to build a tree.\n",
    "\n",
    "2. **Build Decision Trees**: For each sample, it then constructs a decision tree. But unlike a standard decision tree, each node is split using the best among a subset of predictors randomly chosen at that node. This introduces randomness into the model creation process and helps to prevent overfitting.\n",
    "\n",
    "3. **Repeat the Process**: Steps 1 and 2 are repeated to create a forest of decision trees.\n",
    "\n",
    "4. **Make Predictions**: For a new input, each tree in the forest gives its prediction. In a classification problem, the class that has the majority of votes becomes the model’s prediction. In a regression problem, the average of all the tree outputs is the final output of the model.\n",
    "\n",
    "The key to the success of Random Forest is that the model is not overly reliant on any individual decision tree. By averaging the results of a lot of different trees, it reduces the variance and provides a much more stable and robust prediction.\n",
    "\n",
    "Random Forests also have a built-in method of measuring variable importance. This is done by looking at how much the tree nodes that use a particular feature reduce impurity across all trees in the forest, and it is a useful tool for interpretability of the model."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Bagging & Boosting?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Random Forest is an ensemble learning method that operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.\n",
    "\n",
    "Random Forest uses the concept of **bagging** (Bootstrap Aggregating), not boosting. Here's how it works:\n",
    "\n",
    "1. **Bagging**: In bagging, multiple subsets of the original dataset are created using bootstrap sampling. Then, a decision tree is fitted on each of these subsets. The final prediction is made by averaging the predictions (regression) or taking a majority vote (classification) from all the decision trees. Bagging helps to reduce variance and overfitting.\n",
    "\n",
    "2. **Random Subspace Method**: In addition to bagging, Random Forest also uses a method called the random subspace method, where a subset of features is selected randomly to create a split at each node of the decision tree. This introduces further randomness into the model, which helps to reduce variance and overfitting.\n",
    "\n",
    "Boosting, on the other hand, is a different ensemble technique where models are trained sequentially, with each new model being trained to correct the errors made by the previous ones. Models are weighted based on their performance, and higher weight is given to the models that perform well. Boosting can reduce bias and variance, but it's not used in Random Forest. Examples of boosting algorithms include AdaBoost and Gradient Boosting."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Steps Involved in Random Forest Algorithm?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here are the steps involved in the Random Forest algorithm:\n",
    "\n",
    "1. **Select random samples from a given dataset**: This is done with replacement, meaning the same row can be chosen multiple times. This sample will be used to build a tree.\n",
    "\n",
    "2. **Construct a decision tree for each sample and get a prediction result from each decision tree**: Unlike a standard decision tree, each node in the tree is split using the best among a subset of predictors randomly chosen at that node. This introduces randomness into the model creation process and helps to prevent overfitting.\n",
    "\n",
    "3. **Perform a vote for each predicted result**: For a new input, each tree in the forest gives its prediction. In a classification problem, the class that has the majority of votes becomes the model’s prediction. In a regression problem, the average of all the tree outputs is the final output of the model.\n",
    "\n",
    "4. **Select the prediction result with the most votes as the final prediction**: For classification, the mode of all the predictions is returned. For regression, the mean of all the predictions is returned.\n",
    "\n",
    "The key to the success of Random Forest is that the model is not overly reliant on any individual decision tree. By averaging the results of a lot of different trees, it reduces the variance and provides a much more stable and robust prediction."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Important Features of Random Forest?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Random Forest has several important features that make it a popular choice for machine learning tasks:\n",
    "\n",
    "1. **Robustness to Overfitting**: Due to the randomness introduced in building the individual trees, Random Forests are less likely to overfit the training data compared to individual decision trees.\n",
    "\n",
    "2. **Handling of Large Datasets**: Random Forest can handle large datasets with high dimensionality. It can handle thousands of input variables and identify the most significant ones.\n",
    "\n",
    "3. **Versatility**: It can be used for both regression and classification tasks, and it can also handle multi-output problems.\n",
    "\n",
    "4. **Feature Importance**: Random Forests provide an importance score for each feature, allowing for feature selection and interpretability.\n",
    "\n",
    "5. **Out-of-Bag Error Estimation**: In Random Forest, about one-third of the data is not used to train each tree, and this data (called out-of-bag data) can be used to get an unbiased estimate of the model's performance.\n",
    "\n",
    "6. **Parallelizable**: The process of building trees is easily parallelizable as each tree is built independently of the others.\n",
    "\n",
    "7. **Missing Values Handling**: Random Forest can handle missing values. When the dataset has missing values, the Random Forest algorithm will learn the best impute value for the missing values based on the reduction in the impurity.\n",
    "\n",
    "8. **Non-Parametric**: Random Forest is a non-parametric method, which means that it makes no assumptions about the functional form of the transformation from inputs to output. This is an advantage for datasets where the relationship between inputs and output is complex and non-linear.\n",
    "\n",
    "Remember, while Random Forest has these advantages, it also has some disadvantages like being a black box model with limited interpretability compared to a single decision tree, and being slower to train and predict than simpler models like linear models."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Difference Between Decision Tree and Random Forest?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here's a comparison between Decision Trees and Random Forests in a tabular format:\n",
    "\n",
    "| Feature | Decision Tree | Random Forest |\n",
    "| --- | --- | --- |\n",
    "| **Basic** | Single tree | Ensemble of multiple trees |\n",
    "| **Overfitting** | Prone to overfitting | Less prone due to averaging of multiple trees |\n",
    "| **Performance** | Lower performance on complex datasets | Higher performance due to ensemble method |\n",
    "| **Training Speed** | Faster | Slower due to building multiple trees |\n",
    "| **Prediction Speed** | Faster | Slower due to aggregating results from multiple trees |\n",
    "| **Interpretability** | High (easy to visualize and understand) | Lower (hard to visualize many trees) |\n",
    "| **Feature Selection** | Uses all features for splitting a node | Randomly selects a subset of features for splitting a node |\n",
    "| **Handling Unseen Data** | Less effective | More effective due to averaging |\n",
    "| **Variance** | High variance | Low variance due to averaging |\n",
    "\n",
    "Remember, the choice between a Decision Tree and Random Forest often depends on the specific problem and the computational resources available. Random Forests generally perform better, but they require more computational resources and are less interpretable."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Important Hyperparameters in Random Forest?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Random Forest has several important hyperparameters that control its behavior:\n",
    "\n",
    "1. **n_estimators**: This is the number of trees you want to build before taking the maximum voting or averages of predictions. Higher values make the predictions stronger and more stable, but also slow down the computation.\n",
    "\n",
    "2. **max_features**: These are the maximum number of features Random Forest is allowed to try in individual tree. There are multiple options available such as \"auto\", \"sqrt\", \"log2\", or an integer. Typically, sqrt(number of features) is a good starting point.\n",
    "\n",
    "3. **max_depth**: This is the maximum number of levels in each decision tree. You can set it to an integer or leave it as None for unlimited depth. This can be used to control overfitting.\n",
    "\n",
    "4. **min_samples_split**: This is the minimum number of data points placed in a node before the node is split. Higher values prevent a model from learning relations which might be highly specific to the particular sample selected for a tree.\n",
    "\n",
    "5. **min_samples_leaf**: This is the minimum number of data points allowed in a leaf node. Higher values reduce overfitting.\n",
    "\n",
    "6. **bootstrap**: This is a boolean value indicating whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree.\n",
    "\n",
    "7. **oob_score**: Also a boolean, it indicates whether to use out-of-bag samples to estimate the generalization accuracy.\n",
    "\n",
    "8. **n_jobs**: This indicates the number of jobs to run in parallel for both fit and predict. If set to -1, then the number of jobs is set to the number of cores.\n",
    "\n",
    "9. **random_state**: This controls the randomness of the bootstrapping of the samples used when building trees. If the random state is fixed, the model output will be deterministic.\n",
    "\n",
    "10. **class_weight**: This parameter allows you to specify weights for the classes. This is useful if your classes are imbalanced.\n",
    "\n",
    "Remember, tuning these hyperparameters can significantly improve the performance of the model, but it can also lead to overfitting if not done carefully. It's usually a good idea to use some form of cross-validation, such as GridSearchCV or RandomizedSearchCV, to find the optimal values for these hyperparameters."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Coding in Python – Random Forest?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here's a basic example of how to use the Random Forest algorithm for a classification problem in Python using the sklearn library. We'll use the iris dataset, which is a multi-class classification problem, built into sklearn for this example.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Accuracy:  1.0\n"
     ]
    }
   ],
   "source": [
    "from sklearn.ensemble import RandomForestClassifier\n",
    "from sklearn.datasets import load_iris\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.metrics import accuracy_score\n",
    "\n",
    "# Load the iris dataset\n",
    "iris = load_iris()\n",
    "X = iris.data\n",
    "y = iris.target\n",
    "\n",
    "# Split the data into train and test sets\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)\n",
    "\n",
    "# Create the model with 100 trees\n",
    "model = RandomForestClassifier(n_estimators=100, \n",
    "                               bootstrap = True,\n",
    "                               max_features = 'sqrt')\n",
    "\n",
    "# Fit on training data\n",
    "model.fit(X_train, y_train)\n",
    "\n",
    "# Predict the test set\n",
    "predictions = model.predict(X_test)\n",
    "\n",
    "# Calculate the accuracy score\n",
    "accuracy = accuracy_score(y_test, predictions)\n",
    "print(\"Accuracy: \", accuracy)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "\n",
    "This code first loads the iris dataset, then splits it into a training set and a test set. A Random Forest model is created with 100 trees and fitted on the training data. The model is then used to predict the classes of the test set, and the accuracy of these predictions is printed out."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Coding in R – Random Forest?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here's a basic example of how to use the Random Forest algorithm for a classification problem in R using the randomForest package. We'll use the iris dataset, which is a multi-class classification problem, built into R for this example.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Installing package into ‘/home/blackheart/R/x86_64-pc-linux-gnu-library/4.1’\n",
      "(as ‘lib’ is unspecified)\n",
      "\n",
      "randomForest 4.7-1.1\n",
      "\n",
      "Type rfNews() to see new features/changes/bug fixes.\n",
      "\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Call:\n",
      " randomForest(formula = Species ~ ., data = iris, ntree = 100,      importance = TRUE) \n",
      "               Type of random forest: classification\n",
      "                     Number of trees: 100\n",
      "No. of variables tried at each split: 2\n",
      "\n",
      "        OOB estimate of  error rate: 4.67%\n",
      "Confusion matrix:\n",
      "           setosa versicolor virginica class.error\n",
      "setosa         50          0         0        0.00\n",
      "versicolor      0         47         3        0.06\n",
      "virginica       0          4        46        0.08\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<table class=\"dataframe\">\n",
       "<caption>A matrix: 4 × 5 of type dbl</caption>\n",
       "<thead>\n",
       "\t<tr><th></th><th scope=col>setosa</th><th scope=col>versicolor</th><th scope=col>virginica</th><th scope=col>MeanDecreaseAccuracy</th><th scope=col>MeanDecreaseGini</th></tr>\n",
       "</thead>\n",
       "<tbody>\n",
       "\t<tr><th scope=row>Sepal.Length</th><td> 2.313169</td><td> 1.9048166</td><td> 3.275294</td><td> 4.299426</td><td> 9.541714</td></tr>\n",
       "\t<tr><th scope=row>Sepal.Width</th><td> 1.931173</td><td>-0.6401191</td><td> 2.173373</td><td> 1.834349</td><td> 2.033635</td></tr>\n",
       "\t<tr><th scope=row>Petal.Length</th><td>10.176809</td><td>16.8581929</td><td>13.539843</td><td>16.298141</td><td>46.032251</td></tr>\n",
       "\t<tr><th scope=row>Petal.Width</th><td>10.139444</td><td>13.1397106</td><td>14.113327</td><td>14.885764</td><td>41.657799</td></tr>\n",
       "</tbody>\n",
       "</table>\n"
      ],
      "text/latex": [
       "A matrix: 4 × 5 of type dbl\n",
       "\\begin{tabular}{r|lllll}\n",
       "  & setosa & versicolor & virginica & MeanDecreaseAccuracy & MeanDecreaseGini\\\\\n",
       "\\hline\n",
       "\tSepal.Length &  2.313169 &  1.9048166 &  3.275294 &  4.299426 &  9.541714\\\\\n",
       "\tSepal.Width &  1.931173 & -0.6401191 &  2.173373 &  1.834349 &  2.033635\\\\\n",
       "\tPetal.Length & 10.176809 & 16.8581929 & 13.539843 & 16.298141 & 46.032251\\\\\n",
       "\tPetal.Width & 10.139444 & 13.1397106 & 14.113327 & 14.885764 & 41.657799\\\\\n",
       "\\end{tabular}\n"
      ],
      "text/markdown": [
       "\n",
       "A matrix: 4 × 5 of type dbl\n",
       "\n",
       "| <!--/--> | setosa | versicolor | virginica | MeanDecreaseAccuracy | MeanDecreaseGini |\n",
       "|---|---|---|---|---|---|\n",
       "| Sepal.Length |  2.313169 |  1.9048166 |  3.275294 |  4.299426 |  9.541714 |\n",
       "| Sepal.Width |  1.931173 | -0.6401191 |  2.173373 |  1.834349 |  2.033635 |\n",
       "| Petal.Length | 10.176809 | 16.8581929 | 13.539843 | 16.298141 | 46.032251 |\n",
       "| Petal.Width | 10.139444 | 13.1397106 | 14.113327 | 14.885764 | 41.657799 |\n",
       "\n"
      ],
      "text/plain": [
       "             setosa    versicolor virginica MeanDecreaseAccuracy\n",
       "Sepal.Length  2.313169  1.9048166  3.275294  4.299426           \n",
       "Sepal.Width   1.931173 -0.6401191  2.173373  1.834349           \n",
       "Petal.Length 10.176809 16.8581929 13.539843 16.298141           \n",
       "Petal.Width  10.139444 13.1397106 14.113327 14.885764           \n",
       "             MeanDecreaseGini\n",
       "Sepal.Length  9.541714       \n",
       "Sepal.Width   2.033635       \n",
       "Petal.Length 46.032251       \n",
       "Petal.Width  41.657799       "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[1] \"Accuracy:  1\"\n"
     ]
    }
   ],
   "source": [
    "# Install and load the randomForest package\n",
    "install.packages(\"randomForest\")\n",
    "library(randomForest)\n",
    "\n",
    "# Load the iris dataset\n",
    "data(iris)\n",
    "\n",
    "# Create a random forest model\n",
    "set.seed(42)  # for reproducibility\n",
    "iris.rf <- randomForest(Species ~ ., data=iris, ntree=100, importance=TRUE)\n",
    "\n",
    "# Print the model summary\n",
    "print(iris.rf)\n",
    "\n",
    "# Get importance of each feature\n",
    "importance(iris.rf)\n",
    "\n",
    "# Predict using the model\n",
    "iris.pred <- predict(iris.rf, iris)\n",
    "\n",
    "# Check the accuracy\n",
    "accuracy <- sum(iris.pred == iris$Species) / nrow(iris)\n",
    "print(paste(\"Accuracy: \", accuracy))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "\n",
    "This code first loads the iris dataset, then a Random Forest model is created with 100 trees and fitted on the entire dataset. The model summary and feature importance are printed out. The model is then used to predict the classes of the same dataset (this is just for demonstration, in practice you should split your data into training and testing sets), and the accuracy of these predictions is printed out."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# **Thank You!**"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
